1 Executive Summary

1.1 Introduction:

The inherent benefits of the United States University Education System is attracting students from across the continents and cultures. A system which is largely independent from the Federal Government regulations gives them a free hand to offer diversified courses and research opportunities. Moreover, the employment opportunities during/post education are captivating especially for students from developing countries. However, affordability is a big interrogation in front of all aspiring students across the continents. This Data Analysis and Machine Learning modeling research project analyzes the historical data from 10 US state universities with university basic data and course type, student’s demographic data, tuition and boarding fee and pay details from the university Alumini and predict the cost for a prospective out of state/international student.

1.2 Purpose:

Knowing the cost of university education is vital for any student to plan for their education. This question becomes more obvious for any out of state student who wants to pursue education in any of the US universities. This paper addresses the question in a more scientific way using the historical data for the out of state students. The secondary objective of this analytics project is for the University to be aware of the cost of education in their institution for them to optimize and keep it at a relatively competitive level for them to attract best talents and at the same time keep their financial side healthy.

1.3 Methodology:

1.3.1 Data set:

There are two datasets provided for study and analysis. 1) Universities data : It has the recent year data of all universities which includes University ranking and demographics, Student demographics, tuition and room/boarding cost for both in_state & out_of_state students and alumni data like pay after graduation and how much percentage of their students make a difference in the world. We take universities data from the given 10 states (New York, Massachusetts, Connecticut, California, Washington, Delaware, Alaska, North Dakota, New Jersey, Maryland) for analysis. 2) tuition income data : This dataset has historical price and cost data for different income levels. net_cost available in this the data set provides the actual cost spent by the student after the scholarship/award. Net average actual cost for recent years will be taken from this data set and added to the university data set for extracting insights and prediction.

1.3.2 Analysis Methodology:

Data Analysis is done in a structured way starting from extracting dataset, data pre-processing, data cleaning, Exploratory Data Analysis, visualization to machine learning modeling. Data is extraced and tranformation is performed by selecting only 10 States. Data summary has been created to get a sense of the given data set end extract more insights from it. Insignificant data points from the data set are removed. Missing values are handled as appropriately either by stratified mean imputer or by assigning it with a value based on the domain knowledge of the particular variable. Addionally, the pattern in which the data was missing is also considering for handling missing values. Exploratory Data Analysis is done by applying statistical methods. Data visualization and graphical representation of the given data set is performed to get additional inference from the data like trends and relationships. Related data points are considered for statistical hypotheses to assess the plausibility of the assumptions. Finally, regression and classification Machine learning modeling is done to predict the out_of_state_Cost. Modelling accuracy and residuals are documented and the best fit model has been recommended for use and implementation.

1.4 Analysis:

  1. Early career pay and stem percent is thriving factor to increase make world better percent.
  2. High expensive universities has high numebr of women being enrolled with them.
  3. Net cost is 15 to 25 percent less than the out_of_state_total cost due to scholarships and awards. This factor encourages students to apply for merit based scholarships and improves student enrollment.
  4. University rating and its alumni status are the key factors for expensive universities.

1.5 Recommendations:

  1. Room/Boarding is a significant factor and it is essential to be kept at an optimal level to encourage the students
  2. Net_Cost of 15 to 25% less than the Out_of_state total, Keep the Net_cost at optimal to keep the Universities financial position healthy

1.6 Conclusion:

Finding admission in a reputed university with relatively less cost is time consuming tedious research. This Model helps the students to predict and plan for their study and at the same time university can use this Machine Learning model to keep thier cost at an optimal level.

2 Data Mining

2.1 Problem 1

#Loading the universities dataset
data_university <-read.csv("universities.csv", header=T, stringsAsFactors = TRUE)
#Selecting only the required locations
data_univ <- filter(data_university,state %in% c("New York", "Massachusetts", "Connecticut","California","Washington","Alaska","North Dakota", "New Jersey", "Maryland","Delaware"))

#Loading the tution_income dataset
data_tution <-read.csv("tuition_income.csv", header=T, stringsAsFactors = TRUE)
head(data_tution,2)
#Selecting only the required locations with state_code
data_tution <- filter(data_tution,state %in% c("NY", "MA", "CT","CA","WA","AK","ND", "NJ", "MD","DE"))

Note: Tution dataset is selected with required variables and joined here for Problem 6 and Correlation 2 before performing the train test split. Net_Cost of the recent year from the tuition_income data set will be beneficial for predicting Out_of_state_total_cost. Net-cost is the average actually paid by the student after scholarship/award for different income levels average net_cost will help the students to know the actual cost for the university. net cost can also be categorized for different income levels to add additional predictor variables. However, taking the average of net cost will give a fair idea of the average cost to be incurred for the university extracting net_cost from tuition_income data set and add to university dataset

head(data_tution,1)
#select the recent year data data 
data_tution_maxdate <- data_tution %>% 
  select(name, state,total_price,year,campus,net_cost,income_lvl) %>% 
  group_by(state) %>% filter(year == max(year)) %>% # Take the most recent date per state
  group_by() 

#take net_Cost mean for all universities
data_tution_cost <-data_tution_maxdate %>%
  select(name,total_price,net_cost)%>%
  dplyr::filter(net_cost != 0) %>%
  group_by(name)%>% 
  summarise(net_cost = mean(net_cost))
#add net_cost value to university dataset
#left join of university and tuition_cost datasets
data_univ <- left_join(data_univ, data_tution_cost, by = c('name'))
apply(is.na(data_univ),2, sum)
##                         X                      name          total_enrollment 
##                         0                         0                         0 
##                     state                  n_native                   n_asian 
##                         0                         0                         0 
##                   n_black                n_hispanic                 n_pacific 
##                         0                         0                         0 
##             n_nonresident          n_total_minority             n_multiracial 
##                         0                         0                         0 
##                 n_unknown                   n_white                   n_women 
##                         0                         0                         0 
##                state_code                      type             degree_length 
##                         0                         0                         0 
##            room_and_board          in_state_tuition            in_state_total 
##                       233                         0                         0 
##      out_of_state_tuition        out_of_state_total          early_career_pay 
##                         0                         0                       441 
##            mid_career_pay make_world_better_percent              stem_percent 
##                       441                       444                       441 
##                  net_cost 
##                        64

Details of the dataset: Universities data set (initially) : 2159 observations and 27 variables Tution income data set : 209012 observations and 7 variables Universities data (after filtering 10 states) : 563 observations and 27 variables

2.1.1 How many universities are in the resulting dataset?

count(data_univ,"name")

There are 563 universities in the resulting dataset.

2.1.2 Graduates of how many universities have an estimated early career pay greater than 50000 dollars per year?

Univ_grad_estPay <- data_univ %>% filter (early_career_pay > 50000)
head(Univ_grad_estPay)

There are 94 universities who have been estimated early career pay greater than 50000 dollars per year. Beside this, it can be observed that all these universities fall under public/private educational institute type.

2.1.3 How many universities are not for-profit?

univ_nonProf <- data_univ %>% filter (type == c('Public','private'))
## Warning in `==.default`(type, c("Public", "private")): longer object length is
## not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
count(univ_nonProf,"name")

134 Universities fall under the category of not for-profit institutions, with a general assumption that universities which are explicitly mentioned as profit are the ones classified as profit making institutions.

2.2 Problem 2: Making sense of your data

2.2.1 Take a quick look at the data structure. Describe the dataset in a few sentences.

The universities dataset is a list of all US universities from all states predominantly contains 3 components.
a) University location and its enrollment size with type which includes its specialization, ownership status and the degree length they offer
b) Annual Enrollment count with student’s demographic data
c) Fee and cost structure for both in state and out station students
d) Alumni status which includes their pay and how their social status

2.2.2 Which facts or principles do you need to know about the US university system to make a good sense of this dataset?

The US education system is relatively independent from government regulation compared to other countries. It is intensively diversed and most of the universities accept students from all over the world. There are few universities run for profit but largely Private and Public universities are not run for profit. In state students are given benefits in the tution fees in all states. There are several merit scholarship options available for students which will reduce the cost burden for students. Student from all continents prefer to study in US because of a)the diverse course options b) Employment opportunities post study c) Technology advancement and research opportunities

2.2.3 Which things about the data you have at hand you do not know for sure and have to assume?

Several universities will provide Both Bachelor and Master degrees, Which is not clear from this data set. Assuming that the university has a 2 yrs course, it offers only a Master degree Assuming that 100% of the seats are available for all students irrespective of demography For profit is explicitly mentioned. Assuming that Private and Public category are ‘Not For Profit’ Assuming that if there is a STEM percentage value, it offers a STEM course. which is not explicitly mentioned

2.2.4 In this project, you will have to predict out-of-state total cost. Do students choose university solely based on the cost? Which other factors might be important?

Student choose universities not just based on the cost, the other factors include
1) University Ranking
2) State/city of the University
3) Type Private/Public/For Profit
4) Alumini status(pay and social status)
5) STEM and research opportunities
6) Number of international students

2.2.5 To whom is this cost variable more important than the other three? Explain.

Our target feature Out_of_State_total is more important for Outstation/international students.

2.2.6 Formulate a reasonable organizational/business goal, execution of which would require analysis of out-of-state total cost in this dataset.

Optimize the out-of-state total cost for students to attract best fit students from across the world and also ensure the university is sufficiently funded to focus staff welfare with research & development.

2.2.7 Which variable would you have to optimize for this goal? What variables would you have to constrain for it to be reasonable?

Room and Boarding variables need to be optimized to achieve the goal . In addition, support only the pure merit based students for scholarships to reduce the net cost.

2.2.8 Which data manipulations would you have to perform for an analysis or an ML system for this goal? #What would you have to predict? Would classification or regression be more suitable, and why?

Data Manipulation is the foundation of Data Analysis and Machine learning. It includes cleaning the data, making it readable by transforming or calculating additional variables, removing duplicates and outliers, arranging in order, removing less significant variables, summarizing and making visualizations to get inference, applying statistical methods like mean & median, rename variables to make more sense of data.
Here the total cost of an university education in US is predicted for an out of state/international student. Regression is the more suitable algorithm here as we are going to predict the total cost of out-of-state students with the given features available in the dataset.

2.3 Problem 3: EDA and Data preprocessing

#Correcting the state_code for Washington:
data_univ <- within(data_univ, state_code[state_code == 'OH' & state == 'Washington'] <- 'WA')

2.3.1 Check how many categorical features exist and for each categorical feature see how many samples belong to each category.

#Performing summary statistics to analyse the categorical features:
skim (data_univ)
Data summary
Name data_univ
Number of rows 563
Number of columns 28
_______________________
Column type frequency:
factor 5
numeric 23
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
name 0 1 FALSE 562 Mid: 2, Aca: 1, Ade: 1, Adi: 1
state 0 1 FALSE 10 Cal: 194, New: 121, Mas: 78, Was: 45
state_code 0 1 FALSE 10 CA: 194, NY: 121, MA: 77, WA: 45
type 0 1 FALSE 3 Pub: 271, Pri: 270, For: 22
degree_length 0 1 FALSE 2 4 Y: 325, 2 Y: 238

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X 0 1.00 1058.44 631.69 4.00 484.50 1075.00 1597.50 2157.0 ▇▇▇▇▆
total_enrollment 0 1.00 6663.93 7522.65 20.00 1556.50 4160.00 9067.00 51237.0 ▇▂▁▁▁
n_native 0 1.00 30.55 54.05 0.00 4.00 14.00 35.00 549.0 ▇▁▁▁▁
n_asian 0 1.00 615.77 1227.18 0.00 42.50 167.00 582.50 10381.0 ▇▁▁▁▁
n_black 0 1.00 617.98 1240.94 0.00 85.00 238.00 666.50 18727.0 ▇▁▁▁▁
n_hispanic 0 1.00 1522.56 2708.62 0.00 127.00 372.00 1536.50 24235.0 ▇▁▁▁▁
n_pacific 0 1.00 22.58 42.30 0.00 1.00 6.00 26.00 405.0 ▇▁▁▁▁
n_nonresident 0 1.00 367.53 1027.45 0.00 13.00 72.00 230.00 11495.0 ▇▁▁▁▁
n_total_minority 0 1.00 3042.35 4397.38 0.00 379.00 1123.00 3670.50 29427.0 ▇▁▁▁▁
n_multiracial 0 1.00 232.91 336.50 0.00 31.50 102.00 319.00 2523.0 ▇▁▁▁▁
n_unknown 0 1.00 356.51 566.15 0.00 48.00 162.00 427.50 5111.0 ▇▁▁▁▁
n_white 0 1.00 2897.53 3183.93 0.00 676.50 1920.00 3927.00 23596.0 ▇▂▁▁▁
n_women 0 1.00 3678.43 4109.31 0.00 878.00 2367.00 5119.50 36116.0 ▇▁▁▁▁
room_and_board 233 0.59 12110.23 3764.53 30.00 10216.50 12835.00 14814.50 21300.0 ▁▂▅▇▁
in_state_tuition 0 1.00 19908.72 18309.31 1154.00 4506.50 11500.00 35754.00 59985.0 ▇▂▂▂▂
in_state_total 0 1.00 27007.07 23814.61 1154.00 4920.00 18160.00 49146.50 75003.0 ▇▃▂▂▂
out_of_state_tuition 0 1.00 23207.69 16018.95 3460.00 9546.00 16241.00 36158.00 59985.0 ▇▂▂▂▂
out_of_state_total 0 1.00 30306.05 21682.63 3460.00 10045.00 21286.00 49356.00 75003.0 ▇▂▂▂▂
early_career_pay 441 0.22 57638.52 9526.17 39900.00 50325.00 55100.00 63075.00 88800.0 ▃▇▅▁▁
mid_career_pay 441 0.22 104952.46 18503.68 68600.00 90000.00 101500.00 115225.00 158200.0 ▂▇▅▂▁
make_world_better_percent 444 0.21 49.73 7.87 35.00 44.00 49.00 54.00 82.0 ▅▇▃▁▁
stem_percent 441 0.22 21.38 20.45 0.00 9.00 15.50 26.00 100.0 ▇▃▁▁▁
net_cost 64 0.89 18172.69 10522.02 2077.55 9180.49 15753.05 26361.44 59655.5 ▇▅▃▁▁

SKIM function provides an overview of the tidy dataset we have for analysis. This summary represents clearly the column type and its frequency in our dataset. We have 2 column types namely factor and numeric. The Variable type and their characters are also described here with the data distribution in a series of histogram for each variable separately. There are 5 categorical variables in the dataset which are name, state, state_code, type and degree_length name - 562 samples, state - 10 samples, state_code - 10 samples, type - 3 samples degree_length - 2 samples

2.3.2 Visualize the distributions of all features in the data set and summarize your findings.

data_univ %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
  facet_wrap(~ key, scales = "free") +   # In separate panels
  geom_density()
## Warning: Removed 2064 rows containing non-finite values (stat_density).

From the above plots we can conclude that alomst all the variables are not normally distributed. Either they follow a pattern of right skewed or left skewed. Positively skewed distribution or right skewed means that the most frequent values are low which indicates Mode < Median < Mean. Negatively skewed distribution or left skewed means the most frequent values are high;which means Mode > Median > Mean. A normally distributed data will have zero skewness coefficient.

#checking skewness of the target variable
skewness(data_univ$out_of_state_total, na.rm = TRUE)
## [1] 0.5279084

Here we check the skewness of the output variable - out_of_state_total. The skewness coefficient is 0.5729084 which means distribution differs from a normal distribution. The larger the value of skewness, larger the distribution differs from a normal distribution.

#Normalisation
DataExplorer::plot_qq(data_univ[1:10])

DataExplorer::plot_qq(data_univ[11:20])
## Warning: Removed 233 rows containing non-finite values (stat_qq).
## Warning: Removed 233 rows containing non-finite values (stat_qq_line).

DataExplorer::plot_qq(data_univ[21:28])
## Warning: Removed 1831 rows containing non-finite values (stat_qq).
## Warning: Removed 1831 rows containing non-finite values (stat_qq_line).

The qq plot illustrates the normality of the data for each observation. We can see that almost all the variables are mostly non-normal which gives us a hint that normalizing data can be one approach for feature scaling.

2.3.3 Split the data set into a training (70%) and a test (30%) set with the help of stratified sampling based on the degree length. Report mean and standard deviation for out_of_state_total in train and test data set.

#Train test split
set.seed(101) 
split.index <- createDataPartition(data_univ$degree_length, p = .7, list = FALSE)
train_data <- data_univ[ split.index,]
test_data  <- data_univ[-split.index,]
dim(train_data)
## [1] 395  28
dim(test_data)
## [1] 168  28
# check the proportions
prop.table(table(train_data$degree_length))
## 
##    2 Year    4 Year 
## 0.4227848 0.5772152
prop.table(table(test_data$degree_length))
## 
##   2 Year   4 Year 
## 0.422619 0.577381
#Mean of the train dataset - out_of_state_total
print(mean(train_data$out_of_state_total))
## [1] 30179.43
#Mean of the test dataset - out_of_state_total
print(mean(test_data$out_of_state_total))
## [1] 30603.76
#Standard deviation of the train dataset - out_of_state_total
print(sd(train_data$out_of_state_total))
## [1] 21469.93
#Standard deviation of the test dataset - out_of_state_total
print(sd(test_data$out_of_state_total))
## [1] 22236.89

###Data cleaning
Note1: Data cleaning (problem7) is performed here for EDA, better interpretation and visualization purpose.
Note2: Justification for missing value handling method is mentioned in Problem 7

#Handling missing values

#Checking for the NA values
apply(is.na(train_data),2, sum)
##                         X                      name          total_enrollment 
##                         0                         0                         0 
##                     state                  n_native                   n_asian 
##                         0                         0                         0 
##                   n_black                n_hispanic                 n_pacific 
##                         0                         0                         0 
##             n_nonresident          n_total_minority             n_multiracial 
##                         0                         0                         0 
##                 n_unknown                   n_white                   n_women 
##                         0                         0                         0 
##                state_code                      type             degree_length 
##                         0                         0                         0 
##            room_and_board          in_state_tuition            in_state_total 
##                       168                         0                         0 
##      out_of_state_tuition        out_of_state_total          early_career_pay 
##                         0                         0                       311 
##            mid_career_pay make_world_better_percent              stem_percent 
##                       311                       314                       311 
##                  net_cost 
##                        45
#Mean imputer function:
mean_imputer <- function(x) 
{x = round(replace(x, is.na(x) , mean(x, na.rm = TRUE)), digits=0)}

##Handling missing value for room_and_board by stratified mean imputer with respect to state.
train_data <- train_data %>% group_by(state_code) %>%
  mutate(room_and_board=ifelse(is.na(room_and_board),mean(room_and_board,na.rm=TRUE),room_and_board))%>%
  group_by() #removing groupby

#There are 64 NA values in the newly created net_cost variable due to university name mismatch during Join.
#Mean imputer - net_cost
train_data <- train_data %>% mutate_at(28, mean_imputer)

# Handle NA for early_career_pay, mid_career_pay, make_world_better_percent, stem_percent 
train_data[is.na(train_data)] <- 0
#Checking Near Zero Variance for the train dataset
train_data_nzv <- nearZeroVar(train_data)
train_data_nzv
## [1] 26

It can be observed that only one feature has Near Zero Variance. Since it is small it can be neglible.

#Data Cleaning for data_univ as this dataset will be used for data visualization and EDA.
#Treating missing value room_and_board with startified mean imputer
data_univ <- data_univ %>% group_by(state_code) %>%
  mutate(room_and_board=ifelse(is.na(room_and_board),mean(room_and_board,na.rm=TRUE),room_and_board))%>%
  group_by() #removing groupby

#Treating missing value Net_cost with mean imputer
data_univ <- data_univ %>% mutate_at(28, mean_imputer)

#Replacing missing value cost variables with 0
data_univ[is.na(data_univ)] <- 0
apply(is.na(data_univ),2, sum)
##                         X                      name          total_enrollment 
##                         0                         0                         0 
##                     state                  n_native                   n_asian 
##                         0                         0                         0 
##                   n_black                n_hispanic                 n_pacific 
##                         0                         0                         0 
##             n_nonresident          n_total_minority             n_multiracial 
##                         0                         0                         0 
##                 n_unknown                   n_white                   n_women 
##                         0                         0                         0 
##                state_code                      type             degree_length 
##                         0                         0                         0 
##            room_and_board          in_state_tuition            in_state_total 
##                         0                         0                         0 
##      out_of_state_tuition        out_of_state_total          early_career_pay 
##                         0                         0                         0 
##            mid_career_pay make_world_better_percent              stem_percent 
##                         0                         0                         0 
##                  net_cost 
##                         0

2.4 Problem 4: Data Visualization

#Selecting required variables for visualization purpose
data_univ_viz <- data_univ %>% 
  select( state
           , state_code , name , total_enrollment , n_native , n_asian , n_black , n_hispanic , n_pacific , n_nonresident , n_total_minority
           , n_multiracial , n_unknown , n_white , n_women ,room_and_board ,in_state_tuition ,in_state_total ,out_of_state_tuition
           ,out_of_state_total ,net_cost ,early_career_pay ,mid_career_pay ,stem_percent ,make_world_better_percent)
#Update short name and State Code for the universities
data_univ_viz <- data_univ_viz %>%
  mutate(univ_short_name= paste(str_extract(data_univ_viz$name,"(\\w+)"),data_univ_viz$state_code) )

2.4.1 Describe what can be seen and interpret it.

Observation and inferences are given below each visualization.

#Visualization 1 : To find outliers in the target variable
ggplot(data_univ_viz) +
  aes(x = "", y = out_of_state_total) +
  geom_boxplot(fill = "#0c4c8a") +
  ggtitle("Distribution plot for out_of_state_total") +
  theme_minimal()

The box plot describes the range of the target variable. We can observer that there are no outliers and the range lies approximately between 0 and 8000

#Visualization 2
data_univ_usmap <-data_univ_viz %>%
  select(state,state_code,out_of_state_total)%>%
  group_by(state_code)%>% 
  summarise(out_of_state_total = round(mean(out_of_state_total)),2)
#US Map with average out_of_state_total
plot_ly(
  type = "choropleth",
  locations = data_univ_usmap$state_code,
  locationmode = "USA-states",
  z = data_univ_usmap$out_of_state_total) %>%
  layout(geo = list(scope = "usa")) %>%
  layout(title = 'Given 10 US states with thier out_of_state_total', plot_bgcolor = "#e5ecf6")

This shows the world map of US with the out_of_state_total for each state.

#Visualization 3
#Scatterplot - out_of_state_total and total_enrollment with respect to each state 
ggplot(data = data_univ) + 
  geom_point(mapping = aes(x = total_enrollment, y = out_of_state_total, color = state)) +
  ggtitle("Scatterplot - Total enrollment with out_of_state_total with respect to each state")

This scatterplot is represented with each state available in the dataset. We can see that almost all the state with less enrollment provide high out_of_state_total. There exist few exceptions like new york, North Dakoda, California which has increased number of total_enrollment with increased out_of_state_total.

#Visualization 4
#scatterplot3d:
attach(data_univ_viz) 
scatterplot3d(make_world_better_percent, stem_percent, out_of_state_total,
              pch=16, 
              highlight.3d=TRUE,
              type="h", 
              main="3D Scatterplot - STEM & Make world better",
              phi = 0,
              ticktype = "detailed")

A 3D scatter plot is plotted with numerical variables. Here x and y axis is taken as make_world_better_percent and stem_percent with respect to our target variable in the z axis which is out_of_state_total. Here we can see all the data points are rising with increase in out_of_state_total which also denotes a positive correlation between them. Here we can observe that as make_world_better_percent increases as stem_percent increases, which means students from STEM make the world a better place to live in.

#Visualization 5
#Select To5 and bottom 5 universities for out_of_state_total
data_high_Low <- data_univ_viz %>%
  select(univ_short_name,out_of_state_total)
#Top 5 and low 5 univ with wrt cost
data_univ_viz_high <- data_high_Low  %>% top_n(5)
## Selecting by out_of_state_total
data_univ_viz_low <- data_high_Low  %>% top_n(-5)
## Selecting by out_of_state_total
#Combine top and bottom countries
data_univ_viz_high_low<-bind_rows(data_univ_viz_high, data_univ_viz_low)
#Color for bar chart
coul <- brewer.pal(5, "Set2") 
barplot(data_univ_viz_high_low$out_of_state_total,
        main = "Top 5 and low 5 Universities - Out_of_state_total_fee",
        ylab = "",
        xlab = "Universities",
        names.arg = data_univ_viz_high_low$univ_short_name,
        col = coul,
        width=c(1),
        density=c(70),
        las=1,
        horiz = FALSE
)

The barplot represents the most expensive universities and the low cost universities with the given states.

#Visualization 6
#Botom 5- low expensive Univ with all demo units
data_univ_viz_low<- data_univ_viz[order(data_univ_viz$out_of_state_total),][1:5,]
data_univ_viz_low_demo <- data_univ_viz_low  %>%
  select(univ_short_name,n_native, n_asian, n_black,n_women)
#convert to Pivot longer for plot
data_univ_viz_low_PL <- data_univ_viz_low_demo %>%
  pivot_longer(cols=-c("univ_short_name"), names_to = "Index" , values_to = "Value")
ggplot_all_param <- ggplot(data=data_univ_viz_low_PL, aes(x=univ_short_name, y=Value, fill=Index)) +
  geom_bar(stat="identity", position=position_dodge())+
  scale_fill_brewer(palette="Paired")+
  scale_fill_manual(values = coul)+
  theme_minimal()
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
ggplot_all_param + ggtitle("Low expensive Universities with Demographics features")

This bar graph represents the low cost universities and how the demographic features are distributed among them. We can observe a high range in women who have enrolled.

#Visualization 7
#Top10 High Expensive Universities - Total enrollment with all demographic features
data_univ_viz_high<- data_univ_viz[order(-data_univ_viz$out_of_state_total),][1:10,]
data_univ_viz_high_demo <- data_univ_viz_high  %>%
  select(univ_short_name,total_enrollment,n_women,n_total_minority,n_native,n_asian,n_hispanic,n_pacific,n_nonresident,n_black,n_white,
         n_multiracial,n_unknown)
#convert to Pivot longer for plot
data_univ_viz_high_PL <- data_univ_viz_high_demo %>%
  pivot_longer(cols=-c("univ_short_name"), names_to = "Index" , values_to = "Value")
#Horizontal Plot
ggplot(data_univ_viz_high_PL[order(data_univ_viz_high_PL$Index, decreasing = T),],
       aes(Value,  y=reorder(univ_short_name, -Value), label= round(Value,2), fill=factor(Index, levels=c("total_enrollment","n_women","n_total_minority","n_native","n_asian","n_hispanic","n_pacific","n_nonresident","n_white","n_black",
          "n_multiracial",",n_unknown")))) + 
  geom_bar(stat = "identity") +
  scale_fill_discrete(name = "Category", labels = c("total_enrollment" ,"n_women" ,"n_total_minority" ,"n_native" ,"n_asian", "n_hispanic",  "n_pacific","n_nonresident","n_white","n_black","n_multiracial",",n_unknown"))+
  scale_x_continuous(breaks=c(20000,40000,60000,80000,100000,120000,140000,160000,180000)) +
  ylab("Universities") + 
  ggtitle("Top10 Expensive Universities with Demographics features")

#Note: Applied log for all variables to fit them into scale for the bar chart visualization.

This bar graph shows the distribution of women and minority group in expensive universities. We can observe that these two groups plays a major part in the dirtibution. Here major contribution can be observed from the non-native students. There is no much difference between number of white and black. We can see that both the community contribute almost same amount in occupying the expensive universities.

#Visualization 8
#Plotting categorical variable - Type
plot(data_univ$type,col = c("red", "blue","green"))

This plot shows that public and private universities are high.

#Visualization 9
#Plotting categorical variable - Degree Length
plot(data_univ$degree_length, col = c("pink", "purple"))

This plot shows that degree type 4 is higher than degree 2.

2.4.2 Which demographic characteristics are more pronounced for more expensive universities?

With reference to visualization non_residents, women and minority group plays a major contribution to the expensive universities While number of native students are very low in number, Asian student are 5 to 10% and Spanish students are close or little less relative to Asian students.

2.4.3 Write down your assumptions about why universities with lower costs tend to attract certain groups more than more expensive universities.

We can observe that specific groups occupy more in the low expensive universities. This might be because, 1. Affording high tution fee might be a challenge 2. Cost of living in city (Room and board) can be afforded by them.

2.4.4 Which important considerations might these visualization conceal, rather than reveal? Produce visualizations that illustrate the findings that might be unobvious from scatterplots above.

#Visualization 10
#Expensive states - fee components and actual cost
data_univ_viz_high<- data_univ_viz[order(-data_univ_viz$out_of_state_total),][1:10,]
data_univ_viz_high_cost <- data_univ_viz_high  %>%
  select(univ_short_name,room_and_board,in_state_tuition,out_of_state_tuition,net_cost)
#convert to Pivot longer for plot
data_univ_viz_high_PL <- data_univ_viz_high_cost %>%
  pivot_longer(cols=-c("univ_short_name"), names_to = "Index" , values_to = "Value")
ggplot_all_param <- ggplot(data=data_univ_viz_high_PL, aes(x=univ_short_name, y=Value, fill=Index)) +
  geom_bar(stat="identity", position=position_dodge())+
  scale_fill_brewer(palette="Paired")+
  scale_fill_manual(values = coul)+
  theme_minimal()
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
ggplot_all_param + ggtitle("Expensive Universities - 4 Cost factors")

This graph demonstrates the four cost factors in the expensive universities which is the main contribution factor for a university selection. We can see that in_state_tuition and out_of_state_tuition are super expensive comparing to other costs.

2.5 Problem 5: Correlation 1

#Plotting a correlation map
data_univ_corr <- data_univ[c(3,5:15,19:22,24:28,23)]
data_univ_corr <- cor(data_univ_corr)
corrplot(data_univ_corr, method = "circle")

2.5.1 Which feature has the strongest correlation to the target feature? Plot the correlation between them exclusively.

It can be observe that the following variables have strong correlation with the target variable, 1) instate_Tution 2_ instate_Total 3) outntate_Tution

#Correlation and scatterplot
data_univ_corr2 <- data_univ[c(13:15,19:23)]
ggpairs(data_univ_corr2, 
        columns = c("in_state_tuition", "in_state_total", "out_of_state_tuition","out_of_state_total"), 
        upper = list(continuous = wrap("cor",  size = 10)), 
        lower = list(continuous = "smooth"))

This illustrates how the data is plotted against the other variable. We can clearly see that when there is positive increase in all the cost varibles.

2.5.2 Describe and interpret what you can see in this correlation plot. Any unusual occurrences?

Form the correlation and scatter plot we can see that all the variables have same trend. This might be because spurious correlation which exhibits between the data. One reason for it might be residuals exhibit autocorrelation.

2.5.3 Which three features correlate the most with make_world_better_percent, a percentage of alumni who make the world a better place?

From the correlation plot, three features which correlates with make_world_better_percent are, early_career_pay mid_career_pay stem_percent

2.5.5 Hypothesis Statement 3:

H0 (Null hypothesis) : Student who contribute to make_world_better_percent are not from stem percent
H1 (Alternate hypothesis) : Student who contribute to make_world_better_percent are not from stem percent

hyp_test_3 <- cor.test(data_univ$stem_percent, data_univ$make_world_better_percent, 
                       method = "spearman", exact = FALSE)
hyp_test_3
## 
##  Spearman's rank correlation rho
## 
## data:  data_univ$stem_percent and data_univ$make_world_better_percent
## S = 1631798, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9451352

The p-value here is 2.2e-16, which is less than the significance level of 0.05. We reject H0 and accept H1. From this it can concluded that Student who contribute to make_world_better_percent are not from stem percent

2.5.6 Hypothesis Statement 4:

H0 (Null hypothesis) : Students from who make_world_better_percent have no early_career_pay
H1 (Alternate hypothesis) : Students from who make_world_better_percent have early_career_pay

hyp_test_4 <- cor.test(data_univ$early_career_pay, data_univ$make_world_better_percent, 
                       method = "spearman", exact = FALSE)
hyp_test_4
## 
##  Spearman's rank correlation rho
## 
## data:  data_univ$early_career_pay and data_univ$make_world_better_percent
## S = 1067610, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9641045

The p-value here is 2.2e-16, which is less than the significance level of 0.05. We reject H0 and accept H1. From this it can concluded that Students from who make_world_better_percent have early_career_pay

2.5.7 Which hypothesis do you find the most plausible? Which sources are there supporting it?

Hypothesis 3 seems to be more plausible. This can be from a general and domain knowledge that students who engage themselves more in STEM are the one who contribute more in making the world a better place. To illustrate, Engineering and Technology has been high evolving and emerging which contributes more to human advancement.

2.5.8 Which features do you lack in this dataset, which would have helped you determine whether your hypothesis is likely true?

Employmeynt or placement data of the Alumini could have been provided in the dataset, which might have contributed for hypothesis likely to be true.

2.5.9 Explain the difference between Pearson and Spearman correlation coefficients. Which of them have you just used in this problem? Which of them would be more feasible for the analysis you are doing?

Correlation establishes relationship between variables. It is a measure of the extent in which two variables are related to each other. Correlation can be measured using below tests, which are
Pearson correlation coefficients
Spearman correlation coefficients
Kendall correlation

Pearson correlation coefficients:
- It is a Parametric test
- It can be used to find correlation between two quantitative variables in a population.
- Pearson correlation evaluates the linear relationship between two continuous variables.
- To perform Pearson correlation, the variables has to be normally distributed.

Spearman correlation coefficients:
- It is a non-parametric test to evaluates a monotonic relationship.
- It can be used for both quantitative and qualitative variables in a population.
- It does not rely on normality of the data

We use Spearman correlation here since the data is not normally distributed. Eventhough pearson is widely used, Spearman correlation is more feasible for our dataseta as it does not assume any normality between the variables.

2.6 Problem 6: Correlation 2

2.6.1 Explain why you think a combination will be useful before you create them.

Combining features or calculating new variables will give additional data points for prediction. Additionally it will make the predectors more stronger which can contribute to the Calculating n-men from total enrollment and n_women

#Create new attributes by combining different features with each other.
data_univ <- data_univ %>% 
  mutate(n_men = total_enrollment - n_women)%>% 
  mutate(n_native_fee = n_native * in_state_tuition) %>% 
  mutate(n_nonresident_fee = n_nonresident * out_of_state_tuition)

Number of men is calculated form the available data. Other possible new feature could be from n_native fee and n_nonresident_fee which might be beneficial for the organizational growth.

2.6.2 Check their correlation with the out_of_state_total in comparison with the other features from before. Show the correlations with the target feature in a descending order.

cor_data <- data_univ %>%
  select(c(3,5,10,12,14,15,28)) #[c(3,5,10,12,14,15,28,23)]
correlation = cor(cor_data,data_univ$out_of_state_tuition, method = "spearman")
names = rownames(correlation)
abs_cor = abs(correlation)
data_cor = data.frame(X_var = names,abs_cor = abs_cor,cor = correlation)
data_cor[order(data_cor$abs_cor,decreasing = TRUE),]

2.6.3 Do your newly created features have higher correlation with the target feature?

Correlation plot for the target feature with the n-Women and n-men show that there is no significant difference, which means it does not show a strong correlation with the target variable.

2.6.4 Take a look at the data from tuition_income dataset and decide which features or combinations of features you think will also be beneficial for prediction. Repeat steps 1-3 for them. Note that you may need to make some extra transformations to add them.

Note: Tution_income and universities data is already joined in problem 1 before data processing, analysis and EDA.

#correlation Table - Descending
tution_cor_data <- data_univ %>%
  select(c(3,19:22,24:28))
correlation_tut = cor(tution_cor_data,data_univ$out_of_state_tuition, method = "spearman")
names = rownames(correlation_tut)
abs_cor_tut = abs(correlation_tut)
data_cor_tut = data.frame(X_var = names,abs_cor = abs_cor_tut,cor = correlation_tut)
data_cor_tut[order(data_cor_tut$abs_cor,decreasing = TRUE),]

It can be observed that the newly added feature net_cost has strong positive correlation with the target variable

2.6.5 The 10 states you were directed to select for analysis weren’t selected randomly. In fact, they comefrom a certain well-known rating, and comprise top-10, bottom-10, middle 10, second or fourth quintile of it. Find how we selected them, and support your claim with reference to the respective rating.

Ref : https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_GDP_per_capita Top10 per capita GDP 2019 ranked states with 1 outlier removed The outlier is District of Columbia - capital city of the United States of America, not part of any state with extremely high GDP of $200+K with just 0.69 million population.

data_univ_gdp <- data_univ %>%
  mutate(state_gdp2019 = case_when(state == "New York" ~ 90043, state == "Massachusetts" ~ 86942, state == "Connecticut" ~ 81055, 
                                   state == "California" ~ 80563, state == "Washington" ~ 80170, state == "Delaware" ~ 78468, 
                                   state == "Alaska" ~ 76220, state == "North Dakota" ~ 75321, state == "New Jersey" ~ 73451, 
                                   state == "Maryland" ~ 71838, TRUE ~ 0))
plot_ly(data_univ_gdpmap, z=data_univ_gdpmap$state_gdp2019, locations=data_univ_gdpmap$state_code, text=paste0(data_univ_gdpmap$state_code, '<br>Out_of_state_total $: ', data_univ_gdpmap$out_of_state_total), 
        type="choropleth", locationmode="USA-states", colors = 'pink', filename="stackoverflow/simple-choropleth") %>%
  layout(geo = list(scope="usa")) %>%
  layout(title = 'Top 10 U.S. States by GDP per capita with Out of state total cost', plot_bgcolor = "#e5ecf6")

2.6.6 Explain the reasoning you used to find this missing information

All the 10 states having well reputed and expensive universities,which is related to the states economy and GDP. Moreover, it might be correlated with the target variable out_of_state_total cost

2.6.7 Might this statewide score be useful at predicting out-of-state total cost? Explain why or why not. If it is useful, add it to your data and check the correlation.

GDP is the measure of state’s economy which generally has a linear relationship with education and cost of living. GDP score will be a useful predictor for out_of_state total cost. state_gdp2019 has been added as a new feature to the training and testing dateset in subsequent steps for analysis, correlation and modelling.

#Finding correlation for the GDP values with the target variable to decide if it has to be added to your dataset as a new feature.
tution_cor_data <- data_univ_gdp %>%
  select(c(32))
correlation_tut = cor(tution_cor_data,data_univ_gdp$out_of_state_tuition, method = "spearman")
names = rownames(correlation_tut)
abs_cor_tut = abs(correlation_tut)
data_cor_tut = data.frame(X_var = names,abs_cor = abs_cor_tut,cor = correlation_tut)
data_cor_tut[order(data_cor_tut$abs_cor,decreasing = TRUE),]

The variables state_gdp2019 and out_of_state_tuition exhibits a positive correlation of 0.2803806 which is significantly good.

#Adding GDP to the train dataset
train_data <- train_data %>%
  mutate(state_gdp2019 = case_when(state == "New York" ~ 90043, state == "Massachusetts" ~ 86942, state == "Connecticut" ~ 81055, 
                                   state == "California" ~ 80563, state == "Washington" ~ 80170, state == "Delaware" ~ 78468, 
                                   state == "Alaska" ~ 76220, state == "North Dakota" ~ 75321, state == "New Jersey" ~ 73451, 
                                   state == "Maryland" ~ 71838, TRUE ~ 0))

2.6.8 Do any of your new features shine the new light on possible determinants of what makes students feel they are making the world the better place? Explain your insights.

GDP is one predominant factor which makes students to enrolled and receive high standard education from the top ranked universities which makes high qualified scholars. This also is diecrly proportional to STEM percent.

2.7 Problem 7: Data cleaning

Note: Missing values were already handled in problem 3 (EDA and Data preprocessing), since it is a good approach to clean the data before performing EDA and visualization. This will help for better interpretation and getting more insights from the data.

2.7.1 Find out which variables contain missing values. If there are any, how many values are missing?

Missing values:
Missing values are observation which has no data. Processing it without taking care of it might lead to poor performance of the statistical test and machine learning models. It might lead to more biased variable leading to inaccurate results. The following variables had NA which was treated using mean inputation and filling with zero.
room_and_board (Stratified Mean imputation)
early_career_pay (Replaced NA with 0)
mid_career_pay (Replaced NA with 0)
make_world_better_percent (Replaced NA with 0)
stem_percent (Replaced NA with 0)

Missing values are of 3 types.
1. Missing completely at random (MCAR) : There is no relationship between the observed and unobserved values which are missing.
2. Missing at random : There is a relationship between the observed and unobserved values as there exist no pattern between them.
3. Missing not at random : The missing data is related to the unobserved data.
4. Structurally missing data: There is a logical reason behind the non-existence of data.

2.7.2 Which approaches exist for handling missing data? Describe two approaches of how to handle them. Write one advantage and one disadvantage for each of those methods.

There are several approaches to handle missing data:
1. Deletion methods to remove missing values.
2. Data imputation techniques.
3. Regression analysis to systematically eliminate data.

  1. Deletion methods to remove missing values.
    This approach allows us to drop the column fully which as more than 30% of missing values.There is a general assumption that 25 to 30% of missing values are allowed. Hence, beyond that it is advisable to discard variables they are insignificant.
    Advantage:
  • simple and quick approach
    Dis-advantage:
  • We can lose information which might be a significant predictor .
    -Underestimates variances and covariances as it will not consider uncertainty in the missing data
  1. Data imputation:
    There are several approaches for data imputation.
    Here missing values can be interchanged with maximum occurred value in a column or it can replaced by finding mean/median for the missing values.
  2. Average imputation - This uses the average value of the responses from other data entries to fill out missing observations
  3. common-point imputation - A middle point or the most commonly chosen value is used to fill missing observations
    Advantage:
  • There is no losing of information
  • Results can be easily interpreted.
    Dis-advantage:
  • It can artificially reduce the variability of the dataset.
  • While using common-point imputation, it uses mean, median and mode, which is valid only for numerical data and not categorical data.

2.7.3 Handle the missing data by methods you find the most suitable. Explain how you chose the method for each column.

Treating missing values for room_and_board: While analysing the missing values it was found that Structurally missing data (There is some pattern in the missing values). This can be filled with stratified mean.

Treating missing values for early_career_pay,mid_career_pay,make_world_better_percent and STEM percent: Missing values here falls under Structurally missing data. One reason might be because form the universities name it can be generally assumed that there might be Arts and Management Universities and no Engineering and Technology which contributes zero towards STEM percent. Which indrectly states that this is the major reason for other values missing as well. Also it was observed that more than 80% of the data was missing here. Better approach is to drop the variable. However, these features can be a strong predictors. Therefore, it is better approach to fill it with Zero.

2.7.4 Handle the categorical data with one-hot-encoding

A one-hot encoding is an approach to handle categorical variables. This approach assigns a unique value for each possible observation.

#define one-hot encoding function
#Selecting only required variables for Modelling ML:
train_data <- train_data %>% 
  select(-c (X,name,state, state_code,in_state_tuition,in_state_total,out_of_state_tuition, net_cost))
colnames(train_data)
##  [1] "total_enrollment"          "n_native"                 
##  [3] "n_asian"                   "n_black"                  
##  [5] "n_hispanic"                "n_pacific"                
##  [7] "n_nonresident"             "n_total_minority"         
##  [9] "n_multiracial"             "n_unknown"                
## [11] "n_white"                   "n_women"                  
## [13] "type"                      "degree_length"            
## [15] "room_and_board"            "out_of_state_total"       
## [17] "early_career_pay"          "mid_career_pay"           
## [19] "make_world_better_percent" "stem_percent"             
## [21] "state_gdp2019"
#Scaling
train_data[c(1:12,15,17:21)] <- scale(train_data[c(1:12,15,17:21)])
#One-hot encoding
encoding_train <- dummyVars(~ ., data = train_data, fullRank = TRUE)
train_data <- data.frame(predict(encoding_train, newdata = train_data))

2.8 Problem 8: Feature scaling

2.8.1 What feature scaling methods exist? Name five methods and write a short description for three of them. Pick one to use for this data set.

Feature scaling is a method to normalise the distribution of the data. It plays a vital role in feature engineering.
There are several methos in feature scaling. Some of it are,
1. Normalization
2. Standardization
3. Absolute Maximum Scaling
4. Min-Max Scaling
5. Robust Scaling

We have selected Min-Max Scaling for this dataset because while referring to the summary statistics, there is a wide difference in the ranges of the obeservation between the variables. One optimal approach to handle this is the Min-Max Scaling. This is the simple method which scales the values of the observation between -1 to 1.This will help to make all the values to be in same scale.This also helps in handling outliers.

2.8.2 Find the feature importance with a quick Random Forest and show them in a plot. What insights do you get out of it?

set.seed(4543)
data_imp_rf <- randomForest(out_of_state_total ~ ., data=train_data, ntree=1000, keep.forest=FALSE,
                          importance=TRUE)
varImpPlot(data_imp_rf)

From the above table %IncMSE is the most strong and informative measure. An increased mse of predictions defines that the predictors are permuted which means the values randomly shuffled. The higher number, the more important a variable is. room_and_board , degree_length.4.Year , type.Private are first 3 factors which has high importance. %IncMSE is directly proportinal to IncNodePurity.

2.9 Problem 9: Test data

2.9.1 Perform the same data cleaning steps from questions 7 and 8 for the test data set. (Replace missing values, handle the categorical data, add the new features, and scale the features)

Here we perform the same data cleaning method which was done for train dataset.

# Data cleaning for test dataset

#Steo1:
#Adding GDP to the test dataset
test_data <- test_data %>%
  mutate(state_gdp2019 = case_when(state == "New York" ~ 90043, state == "Massachusetts" ~ 86942, state == "Connecticut" ~ 81055, 
                                   state == "California" ~ 80563, state == "Washington" ~ 80170, state == "Delaware" ~ 78468, 
                                   state == "Alaska" ~ 76220, state == "North Dakota" ~ 75321, state == "New Jersey" ~ 73451, 
                                   state == "Maryland" ~ 71838, TRUE ~ 0))
#Step2: Handling missing values:
apply(is.na(test_data),2, sum)
##                         X                      name          total_enrollment 
##                         0                         0                         0 
##                     state                  n_native                   n_asian 
##                         0                         0                         0 
##                   n_black                n_hispanic                 n_pacific 
##                         0                         0                         0 
##             n_nonresident          n_total_minority             n_multiracial 
##                         0                         0                         0 
##                 n_unknown                   n_white                   n_women 
##                         0                         0                         0 
##                state_code                      type             degree_length 
##                         0                         0                         0 
##            room_and_board          in_state_tuition            in_state_total 
##                        65                         0                         0 
##      out_of_state_tuition        out_of_state_total          early_career_pay 
##                         0                         0                       130 
##            mid_career_pay make_world_better_percent              stem_percent 
##                       130                       130                       130 
##                  net_cost             state_gdp2019 
##                        19                         0
#Filling missing value for room_and_board with stratified mean imputer
test_data <- test_data %>% group_by(state_code) %>%
  mutate(room_and_board=ifelse(is.na(room_and_board),mean(room_and_board,na.rm=TRUE),room_and_board))%>%
  group_by() #removing groupby
#Filling missing value for net_cost with Mean
test_data <- test_data %>% mutate_at(28, mean_imputer) #for net_cost

#Filling missing value with zero for early_career_pay, mid_career_pay, make_world_better_percent, stem_percent 
test_data[is.na(test_data)] <- 0 #early_career_pay, mid_career_pay, make_world_better_percent, stem_percent 
apply(is.na(test_data),2, sum)
##                         X                      name          total_enrollment 
##                         0                         0                         0 
##                     state                  n_native                   n_asian 
##                         0                         0                         0 
##                   n_black                n_hispanic                 n_pacific 
##                         0                         0                         0 
##             n_nonresident          n_total_minority             n_multiracial 
##                         0                         0                         0 
##                 n_unknown                   n_white                   n_women 
##                         0                         0                         0 
##                state_code                      type             degree_length 
##                         0                         0                         0 
##            room_and_board          in_state_tuition            in_state_total 
##                         0                         0                         0 
##      out_of_state_tuition        out_of_state_total          early_career_pay 
##                         0                         0                         0 
##            mid_career_pay make_world_better_percent              stem_percent 
##                         0                         0                         0 
##                  net_cost             state_gdp2019 
##                         0                         0
#Step3: Selecting only required variables for Modelling ML:
test_data<- test_data%>% 
select(-c (X,name,state, state_code,in_state_tuition,in_state_total,out_of_state_tuition, net_cost))
colnames(train_data)
##  [1] "total_enrollment"          "n_native"                 
##  [3] "n_asian"                   "n_black"                  
##  [5] "n_hispanic"                "n_pacific"                
##  [7] "n_nonresident"             "n_total_minority"         
##  [9] "n_multiracial"             "n_unknown"                
## [11] "n_white"                   "n_women"                  
## [13] "type.Private"              "type.Public"              
## [15] "degree_length.4.Year"      "room_and_board"           
## [17] "out_of_state_total"        "early_career_pay"         
## [19] "mid_career_pay"            "make_world_better_percent"
## [21] "stem_percent"              "state_gdp2019"
colnames(test_data)
##  [1] "total_enrollment"          "n_native"                 
##  [3] "n_asian"                   "n_black"                  
##  [5] "n_hispanic"                "n_pacific"                
##  [7] "n_nonresident"             "n_total_minority"         
##  [9] "n_multiracial"             "n_unknown"                
## [11] "n_white"                   "n_women"                  
## [13] "type"                      "degree_length"            
## [15] "room_and_board"            "out_of_state_total"       
## [17] "early_career_pay"          "mid_career_pay"           
## [19] "make_world_better_percent" "stem_percent"             
## [21] "state_gdp2019"
#Step4: Checking Near Zero Variance for the test dataset
test_data_nzv <- nearZeroVar(test_data)
print(test_data_nzv)
## integer(0)
#Step5: Scaling the test data
test_data[c(1:12,15,17:21)] <- scale(test_data[c(1:12,15,17:21)])

#Step6: Perform one hot encoding
encoding_test  <- dummyVars(~ ., data = test_data , fullRank = TRUE)
test_data  <- data.frame(predict(encoding_test , newdata = test_data ))

2.10 Problem 10: Regression

2.10.1 Select and train the following regression models on the training set. Linear model, support vector regression, and random forest.

#Linear model:
set.seed(123)
fitControl <- trainControl(method = "none")
lm_reg_model <- train(out_of_state_total ~., 
                       data = train_data,
                       method = "lm",
                       trControl = fitControl)
pred_lm_reg_model = predict(lm_reg_model, test_data)
## Warning in stats::predict.lm(object, ...): prediction from a rank-deficient fit
## may be misleading
summary(lm_reg_model)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34566  -3922   -464   5173  24586 
## 
## Coefficients: (2 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                15560.7     2945.6   5.283 2.16e-07 ***
## total_enrollment            4438.6     4357.9   1.019   0.3091    
## n_native                    1323.6      669.1   1.978   0.0486 *  
## n_asian                    -4466.0     4845.5  -0.922   0.3573    
## n_black                    -4713.6     4321.7  -1.091   0.2761    
## n_hispanic                 -8950.8     9982.8  -0.897   0.3705    
## n_pacific                  -1800.0      909.7  -1.979   0.0486 *  
## n_nonresident               1321.0      991.3   1.333   0.1835    
## n_total_minority           11070.1    16353.2   0.677   0.4989    
## n_multiracial                   NA         NA      NA       NA    
## n_unknown                   -571.2      788.4  -0.725   0.4692    
## n_white                         NA         NA      NA       NA    
## n_women                    -1438.1     3769.7  -0.381   0.7031    
## type.Private               13142.8     2771.7   4.742 3.01e-06 ***
## type.Public                 -315.5     2988.2  -0.106   0.9160    
## degree_length.4.Year       14518.9     1812.3   8.011 1.45e-14 ***
## room_and_board              7067.7      564.9  12.511  < 2e-16 ***
## early_career_pay          -14868.5    14639.0  -1.016   0.3104    
## mid_career_pay             20610.4    14435.5   1.428   0.1542    
## make_world_better_percent  -1561.5     1698.4  -0.919   0.3585    
## stem_percent                 -56.7      880.4  -0.064   0.9487    
## state_gdp2019                921.5      573.3   1.607   0.1088    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9749 on 375 degrees of freedom
## Multiple R-squared:  0.8037, Adjusted R-squared:  0.7938 
## F-statistic: 80.83 on 19 and 375 DF,  p-value: < 2.2e-16
#support vector regression:
set.seed(123)
svm_reg_model = svm(out_of_state_total ~ ., data = train_data)
print(svm_reg_model)
## 
## Call:
## svm(formula = out_of_state_total ~ ., data = train_data)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.04761905 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  260
pred_svm_reg_model = predict(svm_reg_model, test_data)
summary(pred_svm_reg_model)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5686   12923   28139   30687   48779   69167
#Random forest:
set.seed(123)
randomcontrol_reg <- trainControl(
  method = "repeatedcv", 
  number = 4, 
  repeats = 5,
  search = "random")
set.seed(123)
rf_reg_model <- train(out_of_state_total ~ .,  data = train_data,
                        method = "rpart",
                        trControl = randomcontrol_reg,
                        tuneLength = 4)
rf_reg_model
## CART 
## 
## 395 samples
##  21 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 5 times) 
## Summary of sample sizes: 297, 296, 295, 297, 296, 297, ... 
## Resampling results across tuning parameters:
## 
##   cp            RMSE      Rsquared   MAE     
##   8.346853e-06  9568.980  0.8063095  6156.471
##   1.491750e-04  9566.808  0.8063740  6136.795
##   1.827085e-03  9575.562  0.8056733  6165.172
##   1.049060e-02  9868.411  0.7924379  6781.708
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.000149175.
pred_rf_reg_model <- predict(rf_reg_model, test_data)

2.10.2 Evaluate the three regression models on the test set. Which model performs best?

#Creating matrix
table<- matrix(0,3,1) # entry, rows, columns
colnames(table) <- c("RMSE")
row.names(table) <- c("Linear_Regression","SVM","Random Forest")
## Computing RMSE
rmse = function(observed, predicted) {
  sqrt(mean((observed- predicted)**2))
}
rmse_lm_reg <- rmse(test_data$out_of_state_total, pred_lm_reg_model)
rmse_svm_reg <- rmse(test_data$out_of_state_total, pred_svm_reg_model)
rmse_ranfor_reg <- rmse(test_data$out_of_state_total, pred_rf_reg_model)
table[1,1] <- rmse_lm_reg
table[2,1] <- rmse_svm_reg
table[3,1] <- rmse_ranfor_reg
table <- as.data.frame(table)
print(table)
##                        RMSE
## Linear_Regression 10164.355
## SVM                8966.781
## Random Forest      9981.496

RMSE (Root Mean Square Error) is a measure to calculate the resudual erros. It measures the standard deviation of the resuduals. Addionally, it shows how concenterated the data is aound the best fit line in a linear model. Here Support Vector Machine has relatively low RMSE value which means the model can make better predictions compared to other models.

2.10.3 Explain one approach about how the models can be further optimized.

One robust approach to optimize model is the cross-validation. cross-validation is an effective resampling technique to avoid over-fitting and increase accuracy of the model.

2.10.4 Based on the data you have at hand and your background knowledge, which problems might you encounter when trying use your best model to predict out-of-state total cost for the next year?

The potential problems could be,
1. Pandemic or natural disasters.
2. Government Regulations like Visa and approval.

2.10.5 Name and explain two accuracy metrics for regression other than MSE or RMSE. Are any of them better suited for evaluating your models, given the goal you have formulated earlier?

AIC (Akaike’s Information Criteria) and BIC (Bayesian information criteria) are effective metrics for regression model.
AIC: It adds penality for inclusion of any additional feature to the model. As penality increase indicates the error increase when includes. Therefore, the lower the AIC, the better the model.
BIC : Same as AIC. However, there is a strong penality provided here.
AIC and BIC provide unbiased estimates.

2.10.6 What has to change in real life to completely invalidate the results you get and accuracy of your model? I.e., what laws, natural events, societal changes have to happen to make conclusions based on this dataset inadmissible for further decision-making?

Some of the real life issues could be
1. Covid like pandemic impacting the entire world
2. Government policy changes on visa related restrictions.
3. Economical down fall in developing countries might impact the affordibility for the international students.
4. Policy changes like free education, high scholarships provision in other developed countries which will attract more students towards new opportunities.

2.11 Problem 11: Classification

2.11.1 Categorize the target variable (out_of_state_total) into five categories and build a classification model for the above pre-processed data.

#Categorization:
train_data_classifier <- mutate (train_data
                         , out_of_state_total = case_when(train_data$out_of_state_total >= 0 & train_data$out_of_state_total <= 15000 ~ 1
                          , train_data$out_of_state_total > 15000 &  train_data$out_of_state_total <= 30000 ~ 2
                          , train_data$out_of_state_total > 30000 &  train_data$out_of_state_total <= 45000 ~ 3
                          , train_data$out_of_state_total > 45000 &  train_data$out_of_state_total <= 60000 ~ 4
                          , train_data$out_of_state_total > 60000 ~ 5 ))
test_data_classifier <- mutate (test_data
                          , out_of_state_total = case_when(test_data$out_of_state_total >= 0 & test_data$out_of_state_total <= 15000 ~ 1
                          , test_data$out_of_state_total > 15000 &  test_data$out_of_state_total <= 30000 ~ 2
                          , test_data$out_of_state_total > 30000 &  test_data$out_of_state_total <= 45000 ~ 3
                          , test_data$out_of_state_total > 45000 &  test_data$out_of_state_total <= 60000 ~ 4
                          , test_data$out_of_state_total > 60000 ~ 5 ))
# Encoding the target feature as factor
train_data_classifier$out_of_state_total = factor(train_data_classifier$out_of_state_total, levels = c(1,2,3,4,5))
# Encoding the target feature as factor
test_data_classifier$out_of_state_total = factor(test_data_classifier$out_of_state_total, levels = c(1,2,3,4,5))

2.11.2 Train the following classification models on the training set for classification and evaluate the models on the test set: SVM, k-NN, and Random Forest.

#SVM:
# Fitting SVM to the Training set
set.seed(123)
svm_class_model = svm(formula = out_of_state_total ~ .,
                 data = train_data_classifier,
                 type = 'C-classification',
                 kernel = 'linear')
# Predicting the Test set results
pred_svm_class_model = predict(svm_class_model, newdata = test_data_classifier[-17])
# Making the Confusion Matrix
pred_svm_class_model_conf_matrix = table(test_data_classifier[, 17], pred_svm_class_model)
#Accuracy
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy_svm <- accuracy(pred_svm_class_model_conf_matrix)
#KNN:
set.seed(123)
knn_class_model <- knn(train = train_data_classifier,
                      test = test_data_classifier,
                      cl = train_data_classifier$out_of_state_total,
                      k = 3)
# Confusion Matrix
knn_class_model_conf_matrix <- table(test_data_classifier$out_of_state_total, knn_class_model)
#Model Evaluation - Choosing K
misClassError <- mean(knn_class_model != test_data_classifier$out_of_state_total)
#Random Forest:
# Create a Random Forest model with default parameters
set.seed(123)
rf_class_model <- randomForest(out_of_state_total~., data=train_data_classifier, proximity=TRUE) 
pred_rf_class_model <- predict(rf_class_model, test_data_classifier)
rf_class_model_conf_matrix <- confusionMatrix(pred_rf_class_model, test_data_classifier$out_of_state_total)
plot(rf_class_model)

#2. Create a confusion matrix for each model. Interpret the results and compare them with each other.

#Confusion Matrix for SVM:
pred_svm_class_model_conf_matrix
##    pred_svm_class_model
##      1  2  3  4  5
##   1 58  4  3  1  0
##   2 11 12  7  0  0
##   3  2  2 13  7  0
##   4  0  0  4 14  0
##   5  0  0  0 12 18
#Accuracy Score:
accuracy_svm
## [1] 68.45238

SVM provides an accuracy of 68% which is relatively low compared to other models.

#Confusion Matrix for KNN:
knn_class_model_conf_matrix
##    knn_class_model
##      1  2  3  4  5
##   1 65  0  1  0  0
##   2  9 19  2  0  0
##   3  1  0 22  1  0
##   4  0  0  0 18  0
##   5  0  0  0  2 28
#Accuracy score:
print(paste('Accuracy =', 1-misClassError))
## [1] "Accuracy = 0.904761904761905"

KNN provides an accuracy of 90% which is relatively low compared to other models.

#Confusion Matrix for Random Forest:
rf_class_model_conf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4  5
##          1 57 18  3  0  0
##          2  6  9  2  1  0
##          3  2  3 12  1  0
##          4  1  0  5 11  5
##          5  0  0  2  5 25
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6786          
##                  95% CI : (0.6023, 0.7484)
##     No Information Rate : 0.3929          
##     P-Value [Acc > NIR] : 6.75e-14        
##                                           
##                   Kappa : 0.5628          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.8636  0.30000  0.50000  0.61111   0.8333
## Specificity            0.7941  0.93478  0.95833  0.92667   0.9493
## Pos Pred Value         0.7308  0.50000  0.66667  0.50000   0.7812
## Neg Pred Value         0.9000  0.86000  0.92000  0.95205   0.9632
## Prevalence             0.3929  0.17857  0.14286  0.10714   0.1786
## Detection Rate         0.3393  0.05357  0.07143  0.06548   0.1488
## Detection Prevalence   0.4643  0.10714  0.10714  0.13095   0.1905
## Balanced Accuracy      0.8289  0.61739  0.72917  0.76889   0.8913

Random Rorest provides an accuracy of 67% which is relatively low compared to other models.

2.11.3 Which model performs best based on which evaluation method?

KNN provides relativerly better accuracy of 88%. This is based on the accuracy score provided by the model. There is a scope for improving the accuracy score with parameter tunning of k value.

2.11.4 Explain the evaluation method you used.

Evaluation method used here for SVM are the confusion Matrix and accuracy score. confusion Matrix : It is a matrix table which describes the model performance. It is also known as error matrix. accuracy score: It is a measure of correct predictions to total predictions made. Higher the accuracy, effective and robust our model is.

2.11.5 For which applications would classification into these artificial categories be more useful than regression? Name at least two.

  1. Spam/Not spam emails classification
  2. Customer review classification in retail applications like amazon, zolando.
  3. Movie rating

2.11.6 For which applications would regression on original values be more useful? Name at least two.

  1. Real estate price predictions
  2. Stock market predictions